Comprehensive framework of GPU-accelerated image reconstruction for photoacoustic computed tomography

Abstract. Significance Photoacoustic computed tomography (PACT) is a promising non-invasive imaging technique for both life science and clinical implementations. To achieve fast imaging speed, modern PACT systems have equipped arrays that have hundreds to thousands of ultrasound transducer (UST) elements, and the element number continues to increase. However, large number of UST elements with parallel data acquisition could generate a massive data size, making it very challenging to realize fast image reconstruction. Although several research groups have developed GPU-accelerated method for PACT, there lacks an explicit and feasible step-by-step description of GPU-based algorithms for various hardware platforms. Aim In this study, we propose a comprehensive framework for developing GPU-accelerated PACT image reconstruction (GPU-accelerated photoacoustic computed tomography), to help the research community to grasp this advanced image reconstruction method. Approach We leverage widely accessible open-source parallel computing tools, including Python multiprocessing-based parallelism, Taichi Lang for Python, CUDA, and possible other backends. We demonstrate that our framework promotes significant performance of PACT reconstruction, enabling faster analysis and real-time applications. Besides, we also described how to realize parallel computing on various hardware configurations, including multicore CPU, single GPU, and multiple GPUs platform. Results Notably, our framework can achieve an effective rate of ∼871 times when reconstructing extremely large-scale three-dimensional PACT images on a dual-GPU platform compared to a 24-core workstation CPU. In this paper, we share example codes via GitHub. Conclusions Our approach allows for easy adoption and adaptation by the research community, fostering implementations of PACT for both life science and medicine.

1 Introduction into native instructions for GPU or CPU.A researcher familiar with Python can easily use Taichi and, by extension, our framework code.This is especially true for PACT reconstruction, which frequently utilizes delay-and-sum (DAS) type algorithms and which primarily requires extensive indexing and vector summation operations, where vector summation's most common parallel computing optimization technique is reduction.Taichi is capable of automatically implementing reduction optimizations via its compiler, achieving performance, such as highly optimized CUDA code, which is one of the main optimizations our framework provides.Besides, some other Python computation libraries, such as CuPy, can only implement more complex operations by writing CUDA code, making it challenging to be considered a high-performance programming library that is fully Pythonfronted.The proposed GAPAT offers significant performance improvements for PACT image reconstruction, enabling faster analysis and real-time applications.By using the Taichi library, which abstracts away many of the complexities of parallel programming and enables developers to focus on writing their reconstruction algorithms, reconstruction programs can be feasibly adapted to run efficiently on a wide range of modern hardware platforms from CPU to GPU.In addition, the GAPAT can also extend image reconstruction algorithms to large-scale workstations and servers with multiple GPUs to use all computing resources of hardware.

Materials and Methods
Figure 1 describes the overall framework of GAPAT for developing GPU-accelerated PACT image reconstruction using open-source parallel computing tools.The framework consists of four main steps: data preprocessing, algorithm implementation, space-separation reconstruction, and data postprocessing.In the first step, we explain how to set the reconstruction system with config.yamlfile and import data efficiently; in the second and third steps, we explain how to use Taichi and Python multiprocessing to perform parallel computing on different hardware platforms, especially multiple GPUs.In the last step, we perform some postprocessing to enhance the image quality and improve visualization of the target structure where you can use various Python image processing libraries, such as OpenCV, in our framework.

Reconstruction parameter settings
Currently, the PACT raw data primarily originate from two sources: simulation data and real experimental data stored in the data acquisition (DAQ) devices.Our framework uses a config.yamlfile to extract the required parameters.YAML is a human-readable data serialization format.It is often used for configuration files and data exchange between languages with different data structures.Consequently, adjustments to parameters only necessitate changes to the config.yamltext file, facilitating parameter modification and source code exchange.In Table 1, the raw data are from a synthetic planar array by scanning a linear array of 256 elements at 1380 steps, equivalent to a large-scale matrix array with the size of 256 × 1380 elements. 34

Data import and transformation
After reading the parameters from the config.yamlfile, we proceed to import the raw data and complete the transforming process.Some parameters are necessary for the rapid importation of large-scale raw data, enabling us to utilize matrix operation functions in existing powerful numerical computation libraries to process multiple raw data files.This approach avoids the inefficiency of double loops and the repetitive operations often required when preloading a sub-file to obtain related data parameters in traditional methods.As a result, for example, GAPAT can read 1380 raw data files, each containing data collected by 256 detectors in a linear array with a length of 2048 time points, in just 2.2 s, from the in vivo experiments described in Sec.3.2, with specific hardware parameters detailed in Sec.3.1.Wang and Li: Comprehensive framework of GPU-accelerated image reconstruction. . .

Data preprocessing
Prior to the reconstruction process, the raw PA signals must be preprocessed.In our framework, data preprocessing steps may include: (1) bandpass filtering, (2) data resampling, and (3) specific preprocessing for algorithms.For instance, in the DAS algorithm provided within our framework, we set the first and last signal values of all original signals to zero, enabling efficient handling of out-of-bounds indices.Due to the efficient and compact data organization form adopted by the reshaped raw data, various data processing methods can be easily executed in parallel through matrix operations.This not only simplifies usage but also greatly enhances its extensibility.

Back-Projection Algorithm in GAPAT
There are two commonly used time-domain back-projection algorithms, the delay and sum (DAS) and filtered back-projection 11 (FBP), for PACT.In our work, we implemented both algorithms in demonstration codes.The DAS is a relatively simple method, which is described as the following equation: ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 7 ; 5 4 9 p 0 ðrÞ ¼ Noting that the ΔΩ i is the stereo angle of the i'th detector to the reconstruction point and that di is the position vector of i'th detector relative to the coordinate origin, p 0 ðrÞ is the initial sound pressure at the reconstruction point with position vector r, p i ðtÞ represents the signal value of the i'th detector at delay t, and c is the speed of sound.
The FBP algorithm involves one more term of temporal derivative of PA signal, as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 7 ; 4 3 4 In a large planar detection geometry, the normal direction of each UST is the same.Calculating the stereo angle, we can get that for a planar detector array, ignoring constant coefficients, the discrete equation becomes E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 7 ; 3 4 4 DAS∶ p 0 ðjÞ ¼ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 7 ; 2 9 0 FBP∶ p 0 ðjÞ ¼ Here, z j represents the z-coordinate of the j'th reconstruction point and l ij represents the distance from the i'th detector to the j'th reconstruction point.
Currently, both algorithms have been implemented in our framework using parallel loops that iterate over the spatial grid points and the UST elements with Taichi Lang for Python.Specifically, we can implement a kernel in Python that executes the back-projection algorithm based on Taichi's syntax.The Taichi compiler will then automatically apply various optimization techniques to generate highly optimized machine code for the target hardware.All one needs to do is to decorate the corresponding reconstruction functions as Taichi kernels.This flexibility in our framework allows the rapid development and testing of reconstruction algorithms while maintaining high computational performance.

Space-Separation Reconstruction Strategy of GAPAT
Despite Taichi kernels' application, enabling multi-GPU PACT reconstruction remains challenging.This feature is not supported by Taichi kernels, and its implementation in CUDA C++ can be quite complicated.GAPAT offers a much easier solution based on Python multiprocessing, which creates asynchronous processes for each device (usually GPU), assigns subtasks, and integrates results upon completion.Then, we aim to divide the PACT reconstruction task into multiple subtasks for independent devices.In general, there are two approaches: partitioning either the detector array or the reconstruction space.The former faces issues with memory allocation in large or high-resolution spaces.Thus, we adopt the latter approach, dividing the reconstruction space into subspace regions at different depths, allowing efficient GPU allocation and memory management.This also enables depth-specific processing for improved reconstruction results.Specifically, taking a planar matrix detector array as an example, as shown in Algorithm 1, we divide the reconstruction space into subspace regions with different depths.

Image Postprocessing
After the PACT reconstruction process, postprocessing is often performed to enhance the image quality, reduce noise, and improve visualization of results.GAPAT has provided some common postprocessing techniques for PACT, including:

Negative value processing
Negative values in PACT reconstructed images can arise due to various factors, such as noise, artifacts, or limited view and limited bandwidth.Here are some of the methods GAPAT provides: (1) Absolute value: applying the absolute value function to the PACT reconstructed image.
(2) Squaring: squaring the values in the reconstructed image.(3) Hilbert transform: 35 the Hilbert transform is a linear operator that can be applied to PACT reconstructed images to obtain an analytic signal allowing for getting the envelope of the PA signal.

Visualization
Enhance the visual representation of PACT images to facilitate interpretation and communication of the results.Visualization techniques currently in GAPAT mainly include volume rendering realized by OpenCV.The postprocessing techniques can be implemented using a variety of Read all files in data directory into a signal matrix S

4.
Get the detector locations into a matrix D

5.
Initialize signal reconstruction matrix I

6.
Define kernel function that performs FBP algorithm image processing and analysis software, such as MATLAB or some other Python libraries due to the powerful extensibility of our framework.

Results
Both simulated and experimental datasets were used to test and evaluate GAPAT.The simulated dataset was generated using a numerical phantom with solution of theoretical formula of photoacoustic wave function, while the experimental dataset was acquired from in vivo PACT imaging of a human forearm.Both datasets were tested with multiple hardware configurations, including multicore CPU, single GPU, and multiple GPU platforms.Depending on the available hardware resources, users can choose the most suitable configuration.

Simulation Results
In this study, we selected a PACT system using a hemispherical detector matrix array, which is an important setup for real-time 3D PACT.The hemispherical array has a radius of 100 mm with 1024 detectors.Considering the potential disadvantages of rotational symmetry on imaging, our simulation employed a spherical detector array arranged in a so-called "Fibonacci" pattern, 36 as shown in Fig. 2.
Our simulation assumed each array element has an infinite wide bandwidth and infinitesimal point size.For the objects to be imaged, we chose a small sphere with a radius of 1 mm located at the center of the array, and the imaging field of view was set to a 5 mm × 5 mm × 5 mm spatial region near the center of the sphere.The sampling frequency in simulation is 40 MHz.Our hardware setup consists of an AMD Threadripper 3960X CPU, featuring 24 cores and 48 threads, 64 GB DDR4 memory, 2 NVIDIA GeForce RTX 3090 Ti 24 GB graphics cards.We chose the FBP algorithm to reconstruct the 3D image.
In this simulation, we compared the performance of three computational setups on various grid sizes: CPU parallel, single GPU, and dual GPU.Table 2 presents the time costs for each setup, respectively.It is essential to emphasize that while timing the execution of algorithms, we also include the time taken to read the original data from local files and save the reconstructed results back to the local storage for 3D PACT reconstructions that deal with a large volume of data.The observation that dual GPU configurations sometimes show less efficiency compared to a single GPU for smaller grid sizes is attributed to the overhead associated with managing parallel tasks across multiple GPUs.This overhead becomes negligible as the grid size increases.In summary, the dual GPU configuration offered the most efficient performance in terms of time cost, outpacing both the CPU parallel and single GPU setups.
Fig. 2 Detectors arranged in a Fibonacci array on a hemisphere.

In Vivo Experimental Results
Our in vivo experiment data are from a PACT imaging of arm using a synthetic planar array. 34,37ur system employs a Q-switched Nd: YAG pulsed laser (LS-2137/2, LOTIS TII Ltd., Belarus) at 1064 nm for photoacoustic signal excitation.This laser produces pulses with a duration of ∼16 ns and has a repetition rate of 10 Hz.A non-focusing linear array consisting of 256 elements was used.The in vivo experiments utilized a data acquisition system (Marsonics DAQ, Tianjin Langyuan Technology Co., Ltd.China) at 40 MHz sampling rate.The raw data, in the form of int16, underwent a conversion process to match our framework's input requirements, in the form of float32, ensuring compatibility and efficient processing.During imaging, we scanned the linear array above the arm over a total length of 138 mm at a scanning step of 0.1 mm, as shown in Fig. 3.In this case, an equivalent matrix array consists of ∼353 k elements (256 × 1380) is realized.At each detection each array element captures signal data containing 2048 time points.
In reconstructing the arm data of the following results, we use the DAS algorithm instead of the FBP algorithm for reconstruction to achieve higher robustness.Our framework, employing the DAS algorithm, incorporates the solid-angle factor consideration for planar arrays, as detailed in Eq. ( 3).In the case of planar arrays, the line connecting the detector and the reconstruction point often does not coincide with the normal vector of the detector array plane.It is necessary to multiply by the cosine of the angle difference to correct the signal during DAS reconstruction.
We used the same hardware setup as that in the simulation study.Table 3 presents a comparison of the computational performance of three different setups-CPU parallel processing, single GPU, and dual GPU configurations-in PA imaging reconstructions using grid sizes of 240 × 240 × 80, 600 × 600 × 200, and 1200 × 1200 × 400, and Fig. 4 shows the MIP image reconstructed by single GPU and 1200 × 1200 × 400 grid.
The dual GPU setup still outperformed the other configurations.In summary, GAPAT can achieve an acceleration rate of ∼871 times when reconstructing extremely large-scale 3D PACT images on a dual-GPU platform compared to a 24-core workstation CPU, 116 times faster than k-Wave on a same CPU, and on this largest grids, the GPU-based algorithm in k-Wave directly  exhausts the GPU memory and becomes inoperable, which highlights one of the major advantages of our method: more economical and efficient use of GPU memory.In addition to the above, we also compared the performance of GAPAT with a CUDA C++ coded program for this reconstruction task with a grid size of 600 × 600 × 200.The results indicate that, on this scale, the compiled CUDA C++ program takes 318 s (on a single GPU).As shown in Table 3, compared to the CUDA C++ program, our framework achieves a performance improvement of 3.2 times for a single GPU setup and 5.8 times for a dual GPU setup, respectively.The performance of GAPAT often surpasses that of manually coded CUDA C++ programs written by average developers.It is worth mentioning that GAPAT significantly reduces the difficulty of developing multi-GPU parallel programs compared to CUDA C++, which is crucial for 3D image reconstruction by large-scale PACT array systems.
Besides DAS, we also provide source code based on the FBP algorithm using GPU, the results of which can be found in the README.mdfile of our GitHub repository described in Sec. 4. According to our results, the optimized GPU-based FBP algorithm is only ∼5% slower on average compared to its DAS counterpart.

Discussion and Conclusion
Our above experiment results indicate that the GPU code of k-Wave is currently insufficient for large-scale 3D PACT reconstruction tasks due to its substantial memory requirements.Moreover, there has been a lack of updates and maintenance for these codes.The last update to k-Wave's GPU code was made on February 28, 2020, and compatibility issues have been identified when utilizing the latest hardware.In addition, the recent published PACT toolbox PATATO 38 (a Python photoacoustic tomography analysis toolkit) has garnered attention.However, since  it employs Google's JAX library for GPU acceleration, it only supports GPU on Linux platforms and faces the same challenges of underutilization of GPU memory compared to GAPAT.Given these issues, using PATATO may not be friendly for developers who wish to write their own GPU-accelerated new PACT reconstruction algorithms.Using the in vivo experimental data from Sec. 3.2 (with original data dimensions of 256 × 1380 × 2048), Table 4 still demonstrates the superiority of GAPAT.
In practical applications, more scanning geometries exist beyond the spherical and planar geometries discussed in this work.To meet challenges in assigning normal directions of complex geometries within our framework, several strategies can be proposed: (1) user-defined normal direction: we could augment our framework to allow users to define the normal direction for each detector element manually.(2) Parametric representation: For certain complex geometries, a parametric representation might be used to describe the detector positions and orientations.
(3) Hybrid approaches: for particularly intricate geometries, a hybrid approach might be necessary.
By directly interfacing with the data transmitted from the DAQ devices to the host memory, GAPAT can be adapted to facilitate real-time visualization of imaging results.This real-time capability necessitates certain modifications to the original data.Specifically, it requires a transformation and reordering process to ensure that each row of data distinctly represents the readings from an individual detector.It should be noted, however, that realizing this functionality often hinges upon the support provided by the DAQ manufacturer.Their proprietary SDKs (Software Development Kits) typically offer specialized methods for data manipulation and extraction.
In recent years, machine learning techniques have shown great potential for improving image reconstruction in various medical imaging modalities.Our framework can be extended to incorporate machine learning-based reconstruction algorithms, such as convolutional neural networks and recurrent neural networks, to further enhance the reconstruction performance and can keep the same hardware setup.By leveraging the massive parallelism of GPUs, our framework can efficiently implement machine learning-based algorithms, allowing for faster and more accurate PACT image reconstruction.
To facilitate the adaptation of our framework by the research community, we release the source code and documents used in this work under an open-source license.The code of a demo version of GAPAT is publicly accessible from the following GitHub repository: https://github .com/ddffwyb/GAPAT.It provides installation and usage instructions, as well as a set of sample data for execution.This will enable researchers to customize the framework to meet their specific needs, as well as to contribute improvements and new features that can benefit the entire community.
In conclusion, our comprehensive framework for GPU-accelerated PACT reconstruction aims to address computational challenges associated with PACT image reconstruction, offering a flexible and high-performance solution that can be tailored to different hardware configurations.By promoting faster analysis and applications, our framework contributes to the advancement of PACT imaging and its broader adoption in science research and clinical settings.

Algorithm 1
Space-separation reconstruction of FBP algorithm Input: F : config file, N: number of devices Output: I: reconstructed PACT image 1. Load configuration file F and set up all parameters 2. Start timing 3.

Fig. 3
Fig. 3 Schematic diagram of the scanning of linear array to form a synthetic planar array.

Fig. 4
Fig.4Reconstructed maximum intensity projection (MIP) image of the arm using GAPAT by 1200 × 1200 × 400 grids.Data that have been linearly normalized to fill the 0-1 interval.

Table 2
Simulation results with different grids.

Table 3
In vivo experimental results with different grids.